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Abstract 

Human action recognition is one of the most promising research areas at the moment. In this paper, we present an entropy 
based effective approach to recognize human activities from video sequences. Our approach provides a new description 
directly from Motion History Image (MHI) which is an efficient real-time representation for human action. Local entropy of 
pixels in MHI is computed to characterize the texture of motion filed. It helps to match the moment-based feature statistically. 
In an evaluation of this on well established benchmark data sets, we achieve high recognition rates. Using the popular 
Weizmann data set, we achieve the best accuracy of 98.9% in a leave-one-out cross validation procedure. Using the well-known 
KTH data set, we evaluate the robustness of the proposed approach over different scenarios. As a result, this new method is 
computationally efficient, robust with respect to appearance variation, and straight forward to be applied as it builds itself on 
well established and understandable concepts. 
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Introduction 

In the computer vision communities, human action recognition in videos has been extensively studied and drew 
widespread attention in both academic and industrial research. It becomes a key component in many computer 
vision applications, such as video surveillance, human computer interaction, video games and multimedia retrieval, 
et al. Simply speaking, human action recognition attempts to automatically identify whether a person is walking, 
jumping, running or performing other types of action. It is a challenging problem because in practical scenarios 
obtaining a complete analytic description is extremely hard. 

Most existing methods for human action recognition can be classified into two categories based on the frame- 
sequence descriptors used. One category of method is based on the sparse interest points [1-9]. The other is based 
on the dense holistic features [10-14, 28]. The information over space and time is combined to form a global 
representation, such as bag of words or a space-time volume. Then, a classifier is trained to label the resulting 
representation. 

Both of these two kinds of descriptors have their limitations. The dense descriptor is not robust to variations in 
viewpoints and scales [15, 16]. So it doesn't model the actions well enough. However, the sparse descriptor doesn't 
emphasize on modeling the background but the background is also informative for the recognizing of human 
actions which contains contextual information for actions [3, 5]. Many interesting human activities are 
characterized by a complex temporal composition of simple actions. The Motion History Image (MHI) method 
recursively integrates the motion throughout a video action sequence into a single image to represent human 
activity. For recognition, higher-order moment features computed from the template were statistically matched to 
trained models [17, 18]. The approach was used within several interactive virtual environments requiring 
computer interpretation of the participant's motions. Although the MHI method has been proved to be effective in 
real-time constrained situations, it has yet a number of limitations related to the global image feature calculations 
and specific label-based recognition. 

In this paper, we address these limitations by extending the approach with a new descriptor to computer a local 
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(dense) pixel entropy field directly from the MHI for capturing the texture information of the movement. Since 
MHI is not robust to appearance changes of human silhouette, a further description is usually required to not only 
capture mostly the motion information but also be insensitive to covariate condition changes that affect the 
appearance. Shannon Entropy [19] is presented to encode the randomness of pixel values in the silhouette images 
over a complete gait cycle. But it generally considers the values of probability between 0 and 1. To overcome this, 
local entropy is investigated which is one of the motivations of this work. 

Based on computing local entropy, the new descriptor could render in a single image the randomness of pixel 
values in MHI which contains the raw motion information. Dynamic body areas which undergo consistent relative 
motion during one kind of action will lead to high entropy value. On the other hand, those areas that remain static 
would correspond to low values. A Local Entropy Descriptor (LEnD) thus captures mostly both global motion 
information and local texture appearance variation when performing an action. Compared with original MHI [17] 
and hierarchical MHI [20], LEnD keeps the strength of temporal history templates and the compensation for the 
varying speeds that are common to articulated human motion. In the meantime, it avoids the aforementioned 
weakness of existing representations. 

We demonstrate the effectiveness of our LEnD representation through extensive experiments on the public 
benchmark data sets [21, 22, 27]. Our results suggest that the proposed LEnD of MHI outperforms other MHI 
related methods and also the baseline method. In the reminder of this paper, we first review the idea of temporal 
templates of MHI in next section. Then, we present the construction of Local Entropy Descriptor for MHI in the 
third section in detail. In the fourth section, we introduce the dataset used in the evaluation procedure and present 
the experimental results. Finally, we present our conclusions in the last section. 


Motion History Image 


The basic MHI framework is proposed for representing and recognizing human motions [17, 23]. This approach 
aims to extract recognizing information of holistic patterns of motion in an accumulating way. It thus gives a little 
care to structural features. The basis of this representation is a temporal template used for characterizing motion 
but originally is constrained to very particular domains, such as periodicity or facial motion. The general MHI 
template is targeted at representing arbitrary human movements. The advantage of the method is the use of a 
compact, yet descriptive, real-time representation capturing a sequence of motions in a single static image. The 
MHI is constructed by successively layering selected image regions over time using a simple update rule as in 
equation (1). 


MHI g (x,y) 


(t if <p(I(x,y)) * 0 

[0 else if MHI s (x / y)<T -8 


a) 


where each pixel (x, y) in the MHI is marked with a current timestamp r if the function cp signals object presence 
(or motion) in the current video image I(x, y). The remaining timestamps in the MHI are removed if they are older 
than the decay value r - 8 . This update function is called for every new video frame analyzed in the sequence. 

The function cp that selects a pixel location in the input image for inclusion into the MHI can be arbitrarily 
specified. Since the template representation captures both the position and temporal history of a moving object, 
many possibilities for selecting regions of interest are applicable. Detectors may include background subtraction, 
image differencing, optical flow, edges, stereo-depth silhouettes, flesh-colored regions, etc. With an object selection 
process for cp , the representation can accommodate slowly moving regions (<1 pixel/frame) that would otherwise 
be missed by image differencing or standard optical flow. For the results presented here, we used a threshold 
image differencing method. 

To illustrate the result of MHI construction, key frames from the sequences of different human actions performing 
in Weizmann data set and the corresponding (cumulative) MHIs are presented in Fig. 1. For the conveniences of 
displaying the timestamp pixel values in the templates are linearly mapped to graylevel values 0-255. The decay 
parameter 8 in equation (1) is set to 10 (in frame numbers) for the illustration in Fig. 1. Here the brightness of a 
pixel corresponds to its recency in time. That is to say the brighter pixels are the most current timestamps. 
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Depending on the value chosen for the decay parameter 5 , an MHI can represent a wide history of movement. 



side skip walk wavel wave 2 

FIG. 1 EXAMPLES OF MHI (AT THE BOTTOM OF EACH PAIR) CORRESPONDING TO ORIGINAL FRAMES (AT THE TOP OF EACH PAIR) 
FROM VIDEOS OF DIFFERENT KINDS OF ACTIONS IN THE WEIZAMNN DATA SET 

The initial application of MHIs to recognize human movements [23] was to extract several higher-order scale and 
translation invariant moment features and statistically match them to stored model examples using the 
Mahalanobis distance. The limitation with this method is the holistic appearance changes of the moment features 
calculated from the entire template, though it is successful in constrained situations with single and multiple 
cameras. Any occlusions of the body or errors from the implementation of q> would generate serious recognition 
failures. Also the recognition method was limited to only label-based recognition, where it could not give rise to 
any information other than specific identity matches. This motivated us to consider a more localized approach to 
motion analysis of the MHI. 

Local Entropy Descriptor 

According to equation (1), the MHI records the history impression of motion by layering the cp regions over time. 
It is quite apparent in such a temporal template, as shown in Fig. 1; and the progression of movements is captured 
in the dark-to-light intensity gradients. The unique motion patterns of each human action category thus result in its 
own texture style reflected through MHIs. 

Shannon entropy is a statistical measure of randomness that can be used to characterize the texture of the input 
image. It measures the uncertainty associated with a random variable. Considering the intensity value of the 
silhouettes at a fixed pixel location as a discrete random variable, the entropy of this variable can be computed as 
in equation (2). 

K 

H(x,y) = -^p k (x,y)log 2 p k (x,y) (2) 

k= 1 

where x, y are the pixel coordinates and p k (x,y) is the probability that the pixel takes on the Rvalue. Pun [24] has 
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used Shannon's concept for defining the entropy of an image assuming that an image is entirely represented by its 
gray level histogram only. This approach using Shannon entropy function has led to successful segmentation 
algorithms. In this entropy, the measure of ignorance or the information gain is taken as log(l / p . ) . But statistically 

ignorance can be better represented by (1 - p . ) when compared to (1 / p . ) . Reyni entropy is an improvement over 

Shannon entropy. The exponential entropy function of Pal and Pal entropy is found to be more effective for the 
image specific applications as both the upper and the lower limits bounds of the exponential entropic measure are 
defined unlike the logarithmic measure which is undefined when either the occurrence probability of the event is 
zero or when the events are equally likely and the number of events is large. To overcome this problem, a new 
form of entropy calculation is presented in this work, which is termed as local entropy descriptor (LEnD). 

The LEnD of an image outputs an entropy image of the same size as the input. Firstly, it computes the Shannon's 
entropy value of the neighborhood around each pixel in the input image. The entropy value is taken as the 
corresponding pixel's grayscale in the output image. The key point of this stage is to choose an appropriate size for 
the neighborhood, which is generally related with the size of input image. In this work, we set it as an 8-by-8 
square area. Secondly, the LEnD should be normalized for the convenience of recognition step. The input MHIs 
usually doesn't share the same size of foreground region due to the multiformity of human activities. The uniform 
LEnD of MHIs is thus necessary for further handling with classification algorithms. Fig. 2 shows some examples of 
LEnD computed from MHIs of different actions in Weizmann data set. 



bend jack jump pjump run 



side skip walk wave 1 wave 2 

FIG. 2 EXAMPLES OF LEND (AT THE BOTTOM OF EACH PAIR) CORRESPONDING TO MHIS (AT THE TOP OF EACH PAIR) 
CONSTRUCTED FROM DIFFERENT KINDS OF ACTIONS IN THE WEIZMANN DATA SET 


It can be clearly seen that the texture pattern of MHIs is encoded with local entropy values in the LEnD images. 
These patterns could help a lot to decrease the intra-class difference, which is a challenging problem for human 
action recognition. Furthermore, most of temporal motion information remain in LEnD in order to provide 
significant inter-class variations. Note that in Fig. 2 only one frame extracted from an action video is illustrated. 
When applying in later experiments, for every action category there are several key frames combined to supply 
more discriminative power. 
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Experiments 

Our evaluation experiments were performed on two most popular human action data sets, namely, the Weizmann 
data set [21, 22] (see Fig. 1), and the KTH data set [27] (see Fig. 5). 

Weizmann Dataset 

This dataset contains 90 videos with 10 different actions performed by 9 different actors. Fig. 1 shows some 
examples of these actions. Since the extracted silhouettes are also available with the original video sequences for 
the Weizmann dataset, we are free of background subtraction or other data pre-processing operations required. 
The silhouette masks provided consist of action samples including "run", "walk", "skip", "bend", "wave", etc. All 
actions were recorded from the same view-point in a controlled environment. 

According to the aforementioned technical approach, the temporal templates of MHI are constructed firstly for 
each action sequence. Then, the proposed LEnDs are calculated and normalized based on the extracted MHIs. 
During the calculation of LEnDs, we varied the number of key frames included in final feature extraction. In Fig. 3, 
we evaluate the sensitivity of LEnD to the number of key frames Num key by plotting the classification error rate as 

a function of Num key . The testing procedure was performed by using leave-one-out cross validation. That is, for 

every video sequence, we remove the entire of it from the database while other action sequences of the same 
person remain. When using the nearest neighbor classifier, the removed sequence is kept as a test sequence and all 
the remaining sequences were used as training samples. 



FIG. 3 EVALUATION OF LEND WITH VARYING KEY FRAME NUMBERS 

As shown in Fig. 3, the LEnD method is related to the number of key frames chosen for feature extraction when 
Num key is between 1 and 7. However, when selected frames are bigger than 7, the error rate tends to be stable. This 

result conveys a message similar to what is reported in by [12] which found that 1-7 frames are sufficient for basic 
action recognition. 

According to the results shown in Fig. 3, we performed the classification experiments with the Num key as 7. Under 

this situation, we obtained a mean classification accuracy of 98.9% for all nine actions. The confusion matrix for all 
actions is shown in Fig. 4. It can be observed that only 1 out of 90 videos was misclassified in these experiments. 
The only error was caused by the "run" action which was confused with the "skip". This might be a reasonable 
confusion considering the small difference of texture pattern encoded in the LEnDs between MHIs of these actions. 

In comparison with other feature descriptors, we performed another set of experiments with MHI [17], hierarchical 
MHI [20] and other silhouetted based representations [29-31]. Under the same leave-one-out validation and other 
parameter settings, the experimental results are obtained and listed in Table 1 for the comparison of our approach 
and established state-of-the-art approaches. As can be seen, the proposed LEnD method demonstrates the 
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superiority and could outperform state-of-the-art methods. This shows that constructing LEnD representations of 
MHI could provide reliable recognition rates in classifying human action. 
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FIG. 4 CONFUSION TABLE FOR WEIZMANN DATA SET (AVERAGE ACCURACY 98.9%) 
TABLE 1 COMPARISON WITH OTHER METHODS ON WEIZMANN DATA SET 


Method 

Accuracy 

MHI [17] 

96.7% 

hierachical MHI [20] 

97.8% 

method in [29] 

97.8% 

method in [30] 

98.9% 

method in [31] 

96.8% 

this paper 

98.9% 


KTH Dataset 

This data set consists of 600 low-resolution (120x160 pixel) videos depicting 25 persons, performing six actions each 
[27]. These actions appearing in the data set are walking, jogging, running, boxing, hand waving, and hand 
clapping. Four different scenarios have been recorded: outdoors si, outdoors with scale variation s2 , outdoors with 
different clothes s3 , and indoors s4, as illustrated in Fig. 5. 


Walking Jogging Running Boxing Hand waving Hand c tapping 



FIG. 5 VIDEO FRAMES OF KTH DATA SET FOR THE FOUR DIFFERENT SCENARIOS 


In case of the KTH data set, we perform the suggested 2 tests: cross-scenario evaluation and leave-one-out 
evaluation on the whole data set. 

For the first setup, we performed training on subset of {si} which contains video sequences from outdoors 
environment. Each of the remaining three subsets {s2}, {s3}, and {s4} is taken as the test set respectively. Thus, we 
could analyze the influence of different scenarios. Since the proposed LEnD method is based on silhouette images, 
the scenario could impact the quality of silhouette. Due to shadows and imperfect subtraction with the background. 


17 



www . seipub . or g/mt 


Multimedia Technology (MT) Volume 4, 2015 


these silhouettes contain 'Teaks" and "intrusions". For the training-test pairs of {si, s2 }, {si, s3}, and {si, s4}, we 
could evaluate the robustness of our approach to the varying of scale, clothes and environment. Also, we 
performed the experiment on subset {si} only by using leave-one-out cross validation to provide a baseline for the 
evaluation. As a result, we obtain an average classification accuracy of 95.3% for all six actions as the baseline. For 
the cross-scenario results, the average accuracies are 93.3% of {si, s2 }, 92.0% of {si, s3}, and 94.7% of {si, s4}. The 
corresponding confusion tables are shown in Fig. 6. 
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(A) {SI, SI} AVERAGE ACCURACY OF 95.3% (B) {SI, S2} AVERAGE ACCURACY OF 93.5% 
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(C) {SI, S3} AVERAGE ACCURACY OF 92.0% (D) {SI, S4} AVERAGE ACCURACY OF 94.7% 

FIG. 6 CONFUSION TABLES FOR DIFFERENT TRAINING-TEST PAIRS 

As can be seen in Fig. 6, the separation between leg actions and arm actions is clear and the two action classes that 
are the most difficult to be recognized are jogging and running. This result is similar to what is reported in [27]. 
Importantly, it can be observed that scenario differences would impact the classification rate, especially the clothes 
variation (Fig. 6(c)). Furthermore, Fig. 6(c) shows that a small confusion occurs between action classes waving and 
clapping, which doesn't happen in other cross-scenario situation. This is reasonable, since some types of clothes 
could occlude or follow person's hand motion or other kinds of actions. However, even in these cases, high 
classification rates are obtained by the proposed method. This shows that the proposed approach is not sensitive to 
most of scenario changes. 


TABLE 2 RESULTS ON THE ENTIRE KTH DATA SET 


Action Class 

Accuracy 

Boxing 

96.2% 

Hand clapping 

93.1% 

Hand waving 

94.7% 

Jogging 

93.6% 

Running 

93.1% 

Walking 

93.7% 


For the second setup, we use all scenarios to evaluate our approach. In the experiments, the classic leave-one-out 
training method, i.e. 24 persons for training and the other one for test, was adopted and we repeated the 
experiments 25 times to obtain the average results. Note that in one action class, each person has four video 


18 


































































Multimedia Technology (MT) Volume 4, 2015 


www.seipub.org/mt 


sequences of different scenario. The results are shown in Table 2. 

According to Table 2, the average accuracy is computed to be 94.1% which is better than the initial reported result 
on KTH data set in [27]. In addition, we performed experiments by using direct MHI and hierarchical MHI for the 
comparison. The corresponding average accuracies are 90.3% and 91.5%. Hence, all the experimental results from 
both data sets demonstrate the superiority and robustness of the proposed method. 

Conclusions 

In this paper we introduce a Local Entropy Descriptor (LEnD) which characterizes the local texture pattern of 
human actions by using the Motion History Image (MHI) as input. Using this descriptor for MHI, we extracted a 
set of dynamical and region metric invariants of the temporal templates of the dynamical system, and used it for 
action recognition. Experimental validation of the feasibility and potential merits of carrying out action recognition 
using this descriptor is demonstrated on real videos of human actions. 
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